---
title: "Online appendix: You’ve been shadowbanned: Has Facebook's strategy to suppress rather than remove COVID-19 vaccine misinformation actually slowed the spread? (v1)"
author: "Francesco Bailo"
date: '2023-11-24'
output:
  bookdown::pdf_document2:
    toc: true
    keep_tex: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, include = FALSE, message = FALSE, warning = FALSE, cache = T)

library(tidyverse)

ggplot2::theme_set(theme_bw())

library(cowplot)

library(knitr)

library(kableExtra)

```

\clearpage

# Meta News Room and Integrity Timeline

\begin{table}[!h]
\footnotesize
\caption{Meta's COVID-19 content moderation policy announcements}
\begin{tabular}{llp{6cm}p{7cm}}
\textbf{Date} & \textbf{Source} & \textbf{Link}                                                               & \textbf{Summary}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \\
Mar-20                       & Meta News Room                 & \url{https://about.fb.com/news/2020/03/combating-covid-19-misinformation/}                       & "Aggressive" steps being taken to stop misinformation from spreading. This includes educational pop-ups from WHO, the CDC and regional health authorities when people search for COVID related info; launched COVID-19 Information Centre; removal of COVID-19 misinformation leading to imminent physical harm (i.e. false claims about vaccines, treatments and availability, location and severity of outbreak); for claims not leading to physical harm (conspiracy theories concerning origin of virus) we reduce distribution, remove accounts from recommendations. \\
Jun-20                       & Meta News Room                 & \url{https://about.fb.com/news/2020/04/covid-19-misinfo-update/}                                 & To increase disincentives for users to share misinformation, once a piece of COVID related content is rated false by fact-checkers, we reduce its distribution and show warning labels with more context.                                                                                                                                                                                                                                                                                                                                                                  \\
Aug-20                       & Meta News Room                 & \url{https://about.fb.com/news/2020/08/addressing-movements-and-organizations-tied-to-violence/} & QANON is listed as a dangerous organisation and efforts made to restrict them organising on fb such as: removal if they make violent claims, restrictions  distributing content from their accounts, downranking accounts and content in the newsfeed and search, limit recommendations, prohibiting the organisations use of ads, and selling products in marketplace, prohibit them using fundraising tools.                                                                                                                                                             \\
Aug-20                       & Meta News Room                 & https://about.fb.com/news/2020/06/more-context-for-news-articles-and-other-content/        & To help people ascertain the legitimacy of news articles and to combat misinformation, Meta announced they would add a "context button" to news items shared to notify users if the contents are more than 90 days old.                                                                                                                                                                                                                                                                                                                                                    \\
Oct-20                       & Meta News Room                 & \url{https://about.fb.com/news/2020/08/addressing-movements-and-organizations-tied-to-violence/} & Update to Dangerous Organisations policy: Meta announces that if someone searches Facebook using a term related to Qanon on Facebook, a label will cover the content and they will be redirected to the Global Network on Extremism and Technology (GNET) initiative to combat violent extremism.                                                                                                                                                                                                                                                                          \\
Feb-21                       & Meta News Room                 & \url{https://about.fb.com/news/2020/04/covid-19-misinfo-update/}                                 & Update on the claims that would now be removed after consultation with WHO and fact-checking network: COVID-19 is man made or manufactured; vaccines are not effective in preventing disease; its safer to get the disease than a vaccine; vaccines are toxic, cause harm or autism.                                                                                                                                                                                                                                                                                      
\end{tabular}
\end{table}

\clearpage

# Data pipeline and data summary

## Replication notes

To replicate the analysis, you can access the R code in the source file of this document. Raw CrowdTangle data containing identifiable information (i.e. account names and messages) is only available upon request. Yet, de-identified data is made available for direct download and can be used to replicate the data analysis presented in this document. 

## Raw CrowdTangle data

Social media data was obtained from CrowdTangle by requesting the historical archive of all Facebook accounts added to two lists (one for Facebook groups and one for Facebook pages) compiled by the research team.  

```{r vaccine_regex}

vaccine_regex <- 
  paste0("covid|vaccin|pandemic|plandemic|lockdown|mandatory|",
         "coronavirus|virus|jab|mask|CV-19|informed choice|bill gates|vaxeen")

```


```{r read-csv-ct, eval = F}

all_dat <-
  dplyr::bind_rows(
    dplyr::bind_rows(
      read.csv("data/2021-06-17-09-39-46-AEST-Historical-Report-Anti-Vax-2018-12-31--2021-06-17.csv",
               stringsAsFactors = F) %>%
        dplyr::mutate(Likes.at.Posting = as.numeric(Likes.at.Posting),
                      Followers.at.Posting = as.numeric(Followers.at.Posting),
                      PageGroup.Name = Page.Name,
                      posix = as.POSIXct(Post.Created),
                      date = as.Date(Post.Created),
                      Overperforming.Score = as.numeric(Overperforming.Score..weighted.....Likes.1x.Shares.1x.Comments.1x.Love.1x.Wow.1x.Haha.1x.Sad.1x.Angry.1x.Care.1x..),
                      account_type = "Facebook page") %>%
        dplyr::select(PageGroup.Name, Likes.at.Posting, Followers.at.Posting, date, posix, Shares, Overperforming.Score, account_type, Message),
      read.csv("data/2021-11-09-10-29-58-AEDT-Historical-Report-Multiple-Pages-2021-05-31--2021-07-01.csv",
               stringsAsFactors = F) %>%
        dplyr::mutate(Likes.at.Posting = as.numeric(Likes.at.Posting),
                      Followers.at.Posting = as.numeric(Followers.at.Posting),
                      PageGroup.Name = Page.Name,
                      posix = as.POSIXct(Post.Created),
                      date = as.Date(Post.Created),
                      Overperforming.Score = as.numeric(Overperforming.Score..weighted.....Likes.1x.Shares.1x.Comments.1x.Love.1x.Wow.1x.Haha.1x.Sad.1x.Angry.1x.Care.1x..),
                      account_type = "Facebook page") %>%
        dplyr::select(PageGroup.Name, Likes.at.Posting, Followers.at.Posting, date, posix, Shares, Overperforming.Score, account_type, Message)),
    read.csv("data/2021-06-17-09-30-50-AEST-Historical-Report-Main-project-list---groups-2018-12-31--2021-06-17.csv",
             stringsAsFactors = F) %>%
      dplyr::mutate(Likes.at.Posting = as.numeric(Likes.at.Posting),
                    Followers.at.Posting = as.numeric(Followers.at.Posting),
                    PageGroup.Name = Group.Name,
                    posix = as.POSIXct(Post.Created),
                    date = as.Date(Post.Created),
                    Overperforming.Score = as.numeric(Overperforming.Score..weighted.....Likes.1x.Shares.1x.Comments.1x.Love.1x.Wow.1x.Haha.1x.Sad.1x.Angry.1x.Care.1x..),
                    account_type = "Facebook group") %>%
      dplyr::select(PageGroup.Name, Likes.at.Posting, Followers.at.Posting, date, posix, Shares, Overperforming.Score, account_type, Message)) %>%
  dplyr::mutate(Likes.at.Posting = as.numeric(Likes.at.Posting),
                Followers.at.Posting = as.numeric(Followers.at.Posting),
                month = as.Date(format(date, format = "%Y-%m-15")),
                month_fac = factor(month, ordered = T, 
                                   levels = as.character(seq(from = min(month), 
                                                             to = max(month),
                                                             by = "month")))) %>%
  dplyr::mutate(vax_regex = grepl(vaccine_regex, Message)) %>%
  dplyr::select(-Message)

```

```{r deidentification-data, eval = F}

all_dat <-
  all_dat %>%
  dplyr::mutate(`internal id` = as.numeric(factor(PageGroup.Name)),
                PageGroup.deidentified = paste0("De-identified ", all_dat$account_type))

all_dat %>%
  dplyr::distinct(`internal id`, PageGroup.Name) %>%
  readr::write_csv(file = "data/account_name_and_id.csv")
 
all_dat <-
  all_dat %>% 
  dplyr::select(-PageGroup.Name)

save(all_dat, file = "data/all_dat_deid.RData")

```

```{r load-data}

# To replicate analysis, begin here.

load("data/all_dat_deid.RData")

```

\clearpage

```{r table1, include = TRUE}

all_dat %>%
  dplyr::distinct(`internal id`, account_type) %>%
  dplyr::group_by(account_type) %>%
  dplyr::count() %>%
  knitr::kable(caption = "Type and number of Facebook accounts monitored throught CrowdTangle", 
               col.names = c("account type", "n"), booktabs = T, linesep = "") %>%
  kable_styling(latex_options = "striped")

```

```{r table2, include = TRUE}

all_dat %>%
  dplyr::group_by() %>%
  dplyr::summarize(`postings` = format(n(), big.mark = ","),
                   `first posting` = min(posix),
                   `last posting` = max(posix)) %>%
  knitr::kable(caption = "Number of postings and time frame of publication", 
               booktabs = T, linesep = "")


```

```{r figure1, fig.cap="Number of daily postings published by the accounts", include = TRUE}

all_dat %>%
  dplyr::group_by(date) %>%
  dplyr::count() %>%
  ggplot(aes(x = date, y = n)) +
  geom_bar(stat = 'identity')

```

\pagebreak

## Text analysis of the postings' message to identify vaccination-related content

We classified a social media message as discussing vaccination when the text was matched against this regex string  ``r vaccine_regex``.

```{r figure2, fig.cap="Proportion of daily postings with vaccination-related content", include = TRUE}

all_dat %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(perc = sum(vax_regex) / n()) %>%
  ggplot(aes(x = date, y = perc)) +
  scale_y_continuous(label = scales::percent) +
  geom_bar(stat = 'identity')

```

\pagebreak

## Data selection based on vaccination-related content

```{r table3, include = T}

all_dat %>%
  dplyr::group_by(`internal id`,
                  `Facebook account (de-id)` = PageGroup.deidentified) %>%
  dplyr::summarize(posts = format(n(), big.mark = ","),
                   `discussing vaccines/covid` = paste0(round(sum(vax_regex) / n() * 100, 2), "%"),
                   `Likes at Posting (max)` = max(Likes.at.Posting, na.rm = T),
                   `Followers at Posting (max)` = format(max(Followers.at.Posting, na.rm = T), big.mark = ","),
                   `first post` = format(min(date), "%d %b %Y"),
                   `last post` = format(max(date), "%d %b %Y")) %>%
  dplyr::arrange(`internal id`) %>%
  dplyr::mutate(`Likes at Posting (max)` = format(`Likes at Posting (max)`, big.mark = ",")) %>%
  dplyr::ungroup() %>%
  dplyr::mutate() %>%
  knitr::kable(caption = "Detailed statistics for each monitored Facebook accounts", 
               booktabs = T, linesep = "") %>%
  kable_styling(latex_options = c("striped", "scale_down", "hold_position"))

```

We excluded from the analysis accounts for which the proportion of postings with vaccination-related terms over the total number of postings was 1% or lower. 

```{r}

accounts_to_include <-
  all_dat %>%
  dplyr::group_by(`internal id`) %>%
  dplyr::summarize(vax_regex = (sum(vax_regex) / n())) %>%
  dplyr::filter(vax_regex > .01)

all_dat <- 
  all_dat %>%
  dplyr::filter(`internal id` %in% accounts_to_include$`internal id`)

```
 
 
We also excluded all posts published after 17 June 2021 has we did't collect data for all accounts after that date. 
 
```{r}

all_dat <- 
  all_dat %>%
  dplyr::filter(date < as.Date("2021-06-17"))

```
 
 
This left `r nrow(accounts_to_include)` accounts with `r format(nrow(all_dat), big.mark = ",")` postings for the analysis. 
 
\pagebreak
 
## Summary statistics for posts from the `r nrow(accounts_to_include)` selected accounts

```{r table4, include = TRUE}
all_dat %>%
  dplyr::group_by(`internal id`,
                  `Facebook account (de-id)` = PageGroup.deidentified) %>%
  dplyr::summarise(posts = n(),
                   post_day = n() / as.numeric((max(date) - min(date))),
                   Likes.at.Posting = median(Likes.at.Posting, na.rm = T)) %>%
  dplyr::ungroup() %>%
  dplyr::group_by() %>%
  dplyr::summarise(`poststing a day (min.)` = min(post_day),
                   `poststing a day (max.)` = max(post_day),
                   `poststing a day (mean)` = mean(post_day),
                   `poststing a day (s.d.)` = sd(post_day),
                   `likes at posting (min. of median)` = min(Likes.at.Posting, na.rm = T),
                   `likes at posting (max. of median)` = format(max(Likes.at.Posting, na.rm = T), 
                                                      big.mark = ","))  %>%
  knitr::kable(caption = "Summary statistics on postings across accounts", 
               booktabs = T, linesep = "") %>%
  kable_styling(latex_options = c("striped", "scale_down", "hold_position"))

```


```{r table5, include = TRUE}

all_dat %>%
  dplyr::group_by(`Facebook account` = `internal id`,
                  `Facebook account (de-id)` = PageGroup.deidentified) %>%
  dplyr::summarise(Likes.at.Posting = sd(Likes.at.Posting, na.rm = T)) %>%
  dplyr::ungroup() %>%
  dplyr::group_by() %>%
  dplyr::summarize(`likes at posting (min. of s.d.)` = min(Likes.at.Posting), 
                   `likes at posting (max. of s.d.)` =  format(max(Likes.at.Posting), 
                          big.mark = ",")) %>%
  knitr::kable(caption = "Summary statistics on postings for the entire sample", 
               booktabs = T, linesep = "") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

```
 
\pagebreak
 
## Performance of tracked pages (2019-2021)

```{r}

benchmarking_period <-
  c(as.Date("2020-01-01"), 
    as.Date("2020-12-31"))

```

We set an arbitrary benchmarking period from `r benchmarking_period[1]` to `r benchmarking_period[2]` so to have a suitable benchmark for all accounts (a few accounts were created in late 2020).  

The benchmarking was defined as the average number of shares per post received by each account on postings published between `r benchmarking_period[1]` and `r benchmarking_period[2]` after excluding the bottom 2.5% performing posts and the top 2.5% performing posts (in terms of shares).

```{r}

all_dat.benchmark <- 
  all_dat %>%
  dplyr::group_by(`internal id`) %>%
  dplyr::filter(date >= benchmarking_period[1] &
                  date <= benchmarking_period[2]) %>%
  dplyr::filter(Shares >= quantile(Shares, p = .025, na.rm = T),
                Shares <= quantile(Shares, p = .975, na.rm = T)) %>%
  dplyr::group_by(`internal id`) %>%
  dplyr::summarize(Shares = mean(Shares))

```

```{r include = TRUE}

all_dat.benchmark %>%
  knitr::kable(caption = "Mean of shares for 95 perc. of postings published in the benchmarking period", 
               booktabs = T, linesep = "") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

```

```{r }

date_seq <- 
  seq(from = min(all_dat$date), 
      to = as.Date("2021-06-16"),
      by = "day")

meanFun <- function(x) {
  x <- x[ x >= quantile(x, p =.025, na.rm = T) &
            x <= quantile(x, p =.975, na.rm = T) ]
  return(mean(x))
}

all_dat.performance <- 
  data.frame()

for (i in 15:(length(date_seq)-15)) {
  
  this_dat <-
    all_dat %>%
    dplyr::filter(date >= date_seq[i],
                  date <= date_seq[i+14]) %>%
    dplyr::mutate(date = date_seq[i])
  
  this_dat <- 
    this_dat %>%
    dplyr::mutate(bench = all_dat.benchmark$Shares[match(`internal id`, 
                                       all_dat.benchmark$`internal id`)]) %>%
    dplyr::filter(!is.na(bench)) %>%
    dplyr::group_by(`internal id`) %>%
    dplyr::summarize(performance = 
                       (meanFun(Shares) / bench[1]) - 1,
                     weight = median(Shares),
                     .groups = 'keep')
  
  all_dat.performance <- 
    rbind(all_dat.performance,
          this_dat %>% mutate(date = date_seq[i]))
  
}

```

```{r figure3, include = TRUE, fig.cap = 'Share performance of each account as proportion of the bechmark period', fig.width = 12}

ggplot(all_dat.performance, aes(x = date, y = performance)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = 2) +
  facet_wrap(`internal id`~., scales = "free_y") +
  labs(x = NULL) + 
  scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x = element_text(angle = 45))

```

\clearpage

```{r figure4, include = T, fig.width = 11, fig.height = 4, fig.cap = "Daily performance in terms of posts' shares of the 18 Facebook accounts between January 2019 and June 2021 as proportion of their average performance in the period January-December 2020"}

all_dat.median <- 
  all_dat.performance %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(upp = quantile(performance, p = .1, na.rm = T),
                   low = quantile(performance, p = .9, na.rm = T),
                   median_performance = median(performance, na.rm = T),
                   mean_performance = mean(performance, na.rm = T))

ts.p <- 
  all_dat.median %>%
  ggplot(aes(x = date)) +
  geom_ribbon(aes(ymin = low, ymax = upp, fill = "grey70")) +
  geom_line(aes(y = median_performance, linetype = "1")) +
  geom_line(aes(y = mean_performance, linetype = "2"), alpha = .8) +
  geom_hline(yintercept = 0, linetype = 1, size = .2) +
  geom_vline(xintercept = c(
    as.Date("2020-03-25"), # https://about.fb.com/news/2020/03/combating-covid-19-misinformation/
    as.Date("2020-06-25"), # https://about.fb.com/news/2020/06/more-context-for-news-articles-and-other-content/
    as.Date("2020-08-19"), # https://about.fb.com/news/2020/06/more-context-for-news-articles-and-other-content/,
    as.Date("2020-10-21"), # https://about.fb.com/news/2020/08/addressing-movements-and-organizations-tied-to-violence/
    as.Date("2021-02-26")
  ), 
  colour = "black", 
  linetype = 1, size = .2) +
  geom_label(data = 
               data.frame(date = 
                            c(
                              as.Date("2020-03-25"),
                              as.Date("2020-06-25"), 
                              as.Date("2020-08-19"),
                              as.Date("2020-10-21"),
                              as.Date("2021-02-26")
                              ),
                          text = c("1",
                                   "2", 
                                   "3",
                                   "4",
                                   "5"),
                          y = c(8,8,8,8,8)),
             aes(x = date, y = y, label = text), colour = "black") +
  coord_cartesian(ylim = c(NA, NA)) +
  scale_y_continuous(labels = scales::percent, trans=scales::pseudo_log_trans(base = 10),
                     breaks = c(-1, 0, 1, 2, 4, 8, 12)) +
  labs(x = NULL, y = "performance relatively to baseline") +
  scale_fill_identity(name = NULL, guide = 'legend', labels = c('80% distribution')) +
  scale_linetype_manual(name = NULL, 
                        values =c('1'=1,'2'=2), labels = c('median','mean')) +
  theme(legend.position = c(.2, .8)) +
  theme_bw()

ggsave(ts.p, width =11, height = 4, 
       filename = "img/ts.p-median-20.eps")

ts.p

```

1: Facebook vows more "aggressive action" on COVID & vaccine misinformation; 2: labels added to content to show users the source of information 3: Expands lists of “dangerous organisations and individuals” to include QAnon and other vaccine critical organisations, threatens to shadowban any user or page that supports QAnon related content 4: Building on “dangerous organisations and individuals” policy, users and pages sharing support for QAnon would be labelled and users encountering labelled content redirected to counsellors 5: Facebook claims it will remove groups, pages and accounts that keep offending

```{r}

plotFun <- function(page_name, title = NULL) {
  
  theme_set(theme_bw())
  
  this_dat <- 
    all_dat %>%
    dplyr::filter(`internal id` == page_name)
  
  minmax <- 
    c(min(this_dat$date), max(this_dat$date))
  
  cowplot::plot_grid(
    this_dat %>%
      dplyr::group_by(date) %>%
      dplyr::count() %>%
      ggplot(aes(x=date, y = n, group = date)) +
      geom_bar(stat = 'identity') + 
      scale_x_date(limits = minmax) +
      labs(y = NULL, x = 'posts', title = title),
    this_dat %>%
      ggplot(aes(x=posix, y = Followers.at.Posting)) +
      geom_point(size = .5) + 
      scale_x_datetime(limits = as.POSIXct(minmax)) +
      scale_y_continuous(labels = function(x) format(x, big.mark = ",")) +
      labs(y = NULL, x = 'followers'),
    
    this_dat %>%
      dplyr::filter(Shares <= quantile(Shares, p = .9) &
                      Shares >= quantile(Shares, p = .1)) %>%
      dplyr::mutate(month_fac = gsub("-15", "", month_fac)) %>%
      ggplot(aes(x=month, y = Shares, group = month)) +
      geom_boxplot()  + 
      scale_x_date(limits = minmax) + 
      geom_hline(yintercept = 0, size = .2) +
      labs(y = NULL, x = "shares"),
    
    this_dat %>%
      dplyr::filter(Overperforming.Score <= quantile(Overperforming.Score, p = .9, na.rm = T) &
                      Overperforming.Score >= quantile(Overperforming.Score, p = .1, na.rm = T)) %>%
      dplyr::mutate(month_fac = gsub("-15", "", month_fac)) %>%
      ggplot(aes(x=month, y = Overperforming.Score, group = month)) +
      geom_boxplot()  + 
      scale_x_date(limits = minmax) + 
      geom_hline(yintercept = 0, size = .2) +
      scale_y_continuous(labels = scales::percent) +
      labs(y = NULL, x = "Crowdtangle's overperforming score"),
    align = "v", 
    ncol = 1)
  
}

```

```{r figure5, fig.cap = "Posting activity and engagement metrics for a de-identified Facebook page", fig.width = 9, fig.height = 7, include = TRUE}

ts.page.1 <- 
  plotFun(4, 'De-identified Facebook page')

ggsave(ts.page.1,
       filename = "img/ts-page-1.eps",
       width = 9, height = 7)

ts.page.1

```


```{r fig.cap = "Posting activity and engagement metrics for Informed Medical Options Party's Facebook page", figure6, fig.width = 9, fig.height = 7, include = TRUE}

ts.page.2 <- 
  plotFun(11, "Informed Medical Options Party's Facebook page")

ggsave(ts.page.2,
       filename = "img/ts-page-2.eps",
       width = 9, height = 7)

ts.page.2

```

\newpage

## Topic modelling (with Latent Dirichlet Allocation)


```{r}
library(readr)
library(stringr)
require(tidyverse)

facebook_posts_with_topics <- 
  read_delim("data/facebook_posts_with_topics.tsv",
           delim = "\t", escape_double = FALSE, 
           trim_ws = TRUE)

facebook_comments_with_topics <- 
  read_delim("data/facebook_comments_with_topics.tsv",
             delim = "\t", escape_double = FALSE, 
             trim_ws = TRUE)

facebook_posts_with_topics$first_topic <- 
  as.numeric(str_extract(facebook_posts_with_topics$topic_distribution, "[0-9]{1,2}")) + 1

facebook_comments_with_topics$first_topic <- 
  as.numeric(str_extract(facebook_comments_with_topics$topic_distribution, "[0-9]{1,2}")) + 1

```


```{r}
require(tidyr)
require(tidytext)
require(stringr)
require(corrplot)
require(igraph)

corpus.df <- 
  facebook_posts_with_topics %>%
  dplyr::mutate(doc_id = paste0("post_", `...1`),
                text = post_text,
                topic_1 = as.numeric(str_extract(topic_distribution, "[0-9]{1,2}")) + 1,
                prob_1 = as.numeric(str_extract(topic_distribution, "0\\.[0-9]+"))) %>%
  dplyr::select(doc_id:prob_1, date) %>%
  dplyr::bind_rows(facebook_comments_with_topics %>%
                     dplyr::mutate(doc_id = paste0("comment_", `...1`),
                                   text = comment_text,
                                 topic_1 = as.numeric(str_extract(topic_distribution, "[0-9]{1,2}")) + 1,
                                   prob_1 = as.numeric(str_extract(topic_distribution, "0\\.[0-9]+"))) %>%
                     dplyr::select(doc_id:prob_1, date))

corpus.tidy <- 
  corpus.df %>%
  unnest_tokens(word, text)

corpus.tfidf <- 
  corpus.df %>%
  unnest_tokens(word, text) %>%
  dplyr::filter(!word %in% c("s", "t", "https", "it", "www", "com", "i", "au")) %>%
  dplyr::group_by(topic_1, doc_id, word) %>%
  dplyr::count() %>%
  dplyr::ungroup() %>%
  dplyr::group_by(topic_1, word) %>%
  dplyr::count() %>%
  # dplyr::filter(n > 5) %>%
  bind_tf_idf(word, topic_1, n)
```

For exploratory purposes, we conducted a LDA for 26 topics for 107 threads, from which we collected 2842 comments. For each thread, we preprocess the texts of posts and comments removing stop words and punctuation, which resulted in a corpus of `r nrow(corpus.df)` documents, `r sum(grepl( "post", corpus.df$doc_id))` posts and `r sum(grepl( "comment", corpus.df$doc_id))` comments. (The missing three posts and eight comments did not contain any residual textual information after the preprocessing). 

\newpage

```{r include = TRUE}
table(corpus.df$topic_1) %>%
    knitr::kable(caption = "Distribution of documents per each topic (based on highest probability)", 
               col.names = c("topic", "n document"), booktabs = T, linesep = "") %>%
  kable_styling(latex_options = "striped")
```


\newpage

```{r include = TRUE}
corpus.tfidf %>%
  dplyr::group_by(topic_1) %>%
  dplyr::top_n(10, wt = tf_idf) %>%
      knitr::kable(caption = "The 10 terms with the highest tf-idf for each topic are instead showed in the following table.", 
               col.names = c("topic", "term", "occurrences", "tf", "idf", "tf-idf"), booktabs = T, linesep = "", longtable = T) %>%
  kable_styling(latex_options = "striped")
```

